7 research outputs found

    Subgroup discovery for structured target concepts

    Get PDF
    The main object of study in this thesis is subgroup discovery, a theoretical framework for finding subgroups in data—i.e., named sub-populations— whose behaviour with respect to a specified target concept is exceptional when compared to the rest of the dataset. This is a powerful tool that conveys crucial information to a human audience, but despite past advances has been limited to simple target concepts. In this work we propose algorithms that bring this framework to novel application domains. We introduce the concept of representative subgroups, which we use not only to ensure the fairness of a sub-population with regard to a sensitive trait, such as race or gender, but also to go beyond known trends in the data. For entities with additional relational information that can be encoded as a graph, we introduce a novel measure of robust connectedness which improves on established alternative measures of density; we then provide a method that uses this measure to discover which named sub-populations are more well-connected. Our contributions within subgroup discovery crescent with the introduction of kernelised subgroup discovery: a novel framework that enables the discovery of subgroups on i.i.d. target concepts with virtually any kind of structure. Importantly, our framework additionally provides a concrete and efficient tool that works out-of-the-box without any modification, apart from specifying the Gramian of a positive definite kernel. To use within kernelised subgroup discovery, but also on any other kind of kernel method, we additionally introduce a novel random walk graph kernel. Our kernel allows the fine tuning of the alignment between the vertices of the two compared graphs, during the count of the random walks, while we also propose meaningful structure-aware vertex labels to utilise this new capability. With these contributions we thoroughly extend the applicability of subgroup discovery and ultimately re-define it as a kernel method.Der Hauptgegenstand dieser Arbeit ist die Subgruppenentdeckung (Subgroup Discovery), ein theoretischer Rahmen fĂŒr das Auffinden von Subgruppen in Daten—d. h. benannte Teilpopulationen—deren Verhalten in Bezug auf ein bestimmtes Targetkonzept im Vergleich zum Rest des Datensatzes außergewöhnlich ist. Es handelt sich hierbei um ein leistungsfĂ€higes Instrument, das einem menschlichen Publikum wichtige Informationen vermittelt. Allerdings ist es trotz bisherigen Fortschritte auf einfache Targetkonzepte beschrĂ€nkt. In dieser Arbeit schlagen wir Algorithmen vor, die diesen Rahmen auf neuartige Anwendungsbereiche ĂŒbertragen. Wir fĂŒhren das Konzept der reprĂ€sentativen Untergruppen ein, mit dem wir nicht nur die Fairness einer Teilpopulation in Bezug auf ein sensibles Merkmal wie Rasse oder Geschlecht sicherstellen, sondern auch ĂŒber bekannte Trends in den Daten hinausgehen können. FĂŒr EntitĂ€ten mit zusĂ€tzlicher relationalen Information, die als Graph kodiert werden kann, fĂŒhren wir ein neuartiges Maß fĂŒr robuste Verbundenheit ein, das die etablierten alternativen Dichtemaße verbessert; anschließend stellen wir eine Methode bereit, die dieses Maß verwendet, um herauszufinden, welche benannte Teilpopulationen besser verbunden sind. Unsere BeitrĂ€ge in diesem Rahmen gipfeln in der EinfĂŒhrung der kernelisierten Subgruppenentdeckung: ein neuartiger Rahmen, der die Entdeckung von Subgruppen fĂŒr u.i.v. Targetkonzepten mit praktisch jeder Art von Struktur ermöglicht. Wichtigerweise, unser Rahmen bereitstellt zusĂ€tzlich ein konkretes und effizientes Werkzeug, das ohne jegliche Modifikation funktioniert, abgesehen von der Angabe des Gramian eines positiv definitiven Kernels. FĂŒr den Einsatz innerhalb der kernelisierten Subgruppentdeckung, aber auch fĂŒr jede andere Art von Kernel-Methode, fĂŒhren wir zusĂ€tzlich einen neuartigen Random-Walk-Graph-Kernel ein. Unser Kernel ermöglicht die Feinabstimmung der Ausrichtung zwischen den Eckpunkten der beiden unter-Vergleich-gestelltenen Graphen wĂ€hrend der ZĂ€hlung der Random Walks, wĂ€hrend wir auch sinnvolle strukturbewusste Vertex-Labels vorschlagen, um diese neue FĂ€higkeit zu nutzen. Mit diesen BeitrĂ€gen erweitern wir die Anwendbarkeit der Subgruppentdeckung grĂŒndlich und definieren wir sie im Endeffekt als Kernel-Methode neu

    SUSAN: The Structural Similarity Random Walk Kernel

    Get PDF
    Random walk kernels are a very flexible family of graph kernels, in which we can incorporate edge and vertex similarities through positive definite kernels. In this work we study the particular case within this family in which the vertex kernel has bounded support. We motivate this property as the configurable flexibility in terms of vertex alignment between the two graphs on which the walk is performed. We study several fast and intuitive ways to derive structurally aware labels and combine them with such a vertex kernel, which in turn is incorporated in the random walk kernel. We provide a fast algorithm to compute the resulting random walk kernel and we give precise bounds on its computational complexity. We show that this complexity always remains upper bounded by that of alternative methods in the literature and study conditions under which this advantage can be significantly higher. We evaluate the resulting configurations on their predictive performance on several families of graphs and show significant improvements against the vanilla random walk kernel and other competing algorithms. Read More: https://epubs.siam.org/doi/abs/10.1137/1.9781611976700.3

    From Sets of Good Redescriptions to Good Sets of Redescriptions

    Get PDF
    International audienceRedescription mining aims at finding pairs of queries over data variables that describe roughly the same set of observations. These redescriptions can be used to obtain different views on the same set of entities. So far, redescription mining methods have aimed at listing all redescriptions supported by the data. Such an approach can result in many redundant redescriptions and hinder the user's ability to understand the overall characteristics of the data. In this work, we present an approach to find a good set of redescriptions, instead of finding a set of good redescriptions. That is, we present a way to remove the redundant redescriptions from a given set of redescriptions. We measure the redundancy using a framework inspired by the subjective interestingness based on maximum-entropy distributions as proposed by De Bie in 2011. Redescriptions, however, raise their unique requirements on the framework, and our solution differs significantly from the existing ones. Notably, our approach can handle disjunctions and conjunctions in the queries, whereas the existing approaches are limited only to conjunctive queries. The framework also reduces the redundancy in the redescription mining results, as we show in our empirical evaluation

    Naming the most Anomalous Cluster in Hilbert Space for Structures with Attribute Information

    Get PDF
    We consider datasets consisting of arbitrarily structured en- tities (e.g., molecules, sequences, graphs, etc) whose sim- ilarity can be assessed with a reproducing kernel (or a family thereof). These entities are assumed to additionally have a set of named attributes (e.g.: number_of_atoms, stock_price, etc). These attributes can be used to classify the structured entities in discrete sets (e.g., ‘number_of_atoms < 3’, ‘stock_price ≀ 100’, etc) and can effectively serve as Boolean predicates. Our goal is to use this side-information to provide named kernel-based anomaly detection. To this end, we propose a method which is able to find among all possible entity subsets that can be described as a conjunction of the available predicates either a) the optimal cluster within the Reproducing Kernel Hilbert Space, or b) the most anomalous subset within the same space. Our method employs combinatorial optimisation of an adaptation of the Maximum-Mean-Discrepancy measure that captures the above intuition. Additionally, we propose a cri- terion to select the optimal one out of a family of kernels in a way that preserves the available side-information. Finally, we provide several real world datasets that demonstrate the usefulness of our proposed method

    Discovering Robustly Connected Subgraphs with Simple Descriptions

    Get PDF
    We study the problem of discovering robustly connected subgraphs that have simple descriptions. Our aim is, hence, to discover vertex sets which not only a) induce a subgraph that is difficult to fragment into disconnected components, but also b) can be selected from the entire graph using just a simple conjunctive query on their vertex attributes. Since many subgraphs do not have such a simple logical description, first mining robust subgraphs and post-hoc discovering their description leads to sub-optimal results. Instead, we propose to optimise over describable subgraphs only. To do so efficiently we propose a non-redundant iterative deepening approach, which we equip with a linear-time tight optimistic estimator that allows pruning large parts of the search space. Extensive empirical evaluation shows that our method can handle large real-world graphs, and discovers easily interpretable and meaningful subgraphs

    Efficiently Discovering Locally Exceptional yet Globally Representative Subgroups

    Get PDF
    Subgroup discovery is a local pattern mining technique to find interpretable descriptions of sub-populations that stand out on a given target variable. That is, these sub-populations are exceptional with regard to the global distribution. In this paper we argue that in many applications, such as scientific discovery, subgroups are only useful if they are additionally representative of the global distribution with regard to a control variable. That is, when the distribution of this control variable is the same, or almost the same, as over the whole data. We formalise this objective function and give an efficient algorithm to compute its tight optimistic estimator for the case of a numeric target and a binary control variable. This enables us to use the branch-and-bound framework to efficiently discover the top-kk subgroups that are both exceptional as well as representative. Experimental evaluation on a wide range of datasets shows that with this algorithm we discover meaningful representative patterns and are up to orders of magnitude faster in terms of node evaluations as well as time.Comment: 10 pages, To appear in ICDM1

    Omen: discovering sequential patterns with reliable prediction delays

    Get PDF
    Suppose we are given a discrete-valued time series XX X of observed events and an equally long binary sequence YY Y that indicates whether something of interest happened at that particular point in time. We consider the problem of mining serial episodes, sequential patterns allowing for gaps, from XX X that reliably predict those interesting events. With reliable we mean patterns that not only predict that an interesting event is likely to follow, but in particular that we can also accurately tell how how long until that event will happen. In other words, we are specifically interested in patterns with a highly skewed distribution of delays between pattern occurrences and predicted events. As it is unlikely that a single pattern can explain a complex real-world progress, we are after the smallest, least redundant set of such patterns that together explain the interesting events well. We formally define this problem in terms of the Minimum Description Length principle, by which we identify the best patterns as those that describe the occurrences of interesting events YY Y most succinctly given the data over XX X . As neither discovering the optimal explanation of YY Y given a set of patterns, nor the discovery of optimal pattern set are problems that allow for straightforward optimization, we break the problem in two and propose effective heuristics for both. Through extensive empirical evaluation, we show that both our main method, Omen , and its fast approximation fOmen , work well in practice and both quantitatively and qualitatively beat the state of the art
    corecore